Overview

Dataset statistics

Number of variables12
Number of observations691
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory64.9 KiB
Average record size in memory96.2 B

Variable types

NUM11
BOOL1

Reproduction

Analysis started2020-06-17 14:04:22.803543
Analysis finished2020-06-17 14:05:15.908355
Duration53.1 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

Uniformity of Cell Shape is highly correlated with Uniformity of Cell SizeHigh correlation
Uniformity of Cell Size is highly correlated with Uniformity of Cell ShapeHigh correlation
df_index has unique values Unique
Bare Nuclei has 413 (59.8%) zeros Zeros

Variables

df_index
Real number (ℝ≥0)

UNIQUE

Distinct count691
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean348.9479015918958
Minimum0
Maximum698
Zeros1
Zeros (%)0.1%
Memory size5.4 KiB

Quantile statistics

Minimum0
5-th percentile34.5
Q1172.5
median351
Q3523.5
95-th percentile662.5
Maximum698
Range698
Interquartile range (IQR)351

Descriptive statistics

Standard deviation202.3461166
Coefficient of variation (CV)0.5798748629
Kurtosis-1.207385754
Mean348.9479016
Median Absolute Deviation (MAD)176
Skewness-0.006482664191
Sum241123
Variance40943.95091
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
69810.1%
 
22710.1%
 
23510.1%
 
23410.1%
 
23310.1%
 
23210.1%
 
23110.1%
 
23010.1%
 
22910.1%
 
22810.1%
 
Other values (681)68198.6%
 
ValueCountFrequency (%) 
010.1%
 
110.1%
 
210.1%
 
310.1%
 
410.1%
 
ValueCountFrequency (%) 
69810.1%
 
69710.1%
 
69610.1%
 
69510.1%
 
69410.1%
 

ID
Real number (ℝ≥0)

Distinct count645
Unique (%)93.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1073333.4356005788
Minimum61634
Maximum13454352
Zeros0
Zeros (%)0.0%
Memory size5.4 KiB

Quantile statistics

Minimum61634
5-th percentile411876.5
Q1872549
median1171710
Q31238437
95-th percentile1333946
Maximum13454352
Range13392718
Interquartile range (IQR)365888

Descriptive statistics

Standard deviation619295.2971
Coefficient of variation (CV)0.5769831411
Kurtosis256.8890813
Mean1073333.436
Median Absolute Deviation (MAD)104381
Skewness13.68366131
Sum741673404
Variance3.83526665e+11
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
118240460.9%
 
127609150.7%
 
73411120.3%
 
124060320.3%
 
38510320.3%
 
56068020.3%
 
117405720.3%
 
82282920.3%
 
119864120.3%
 
111457020.3%
 
Other values (635)66496.1%
 
ValueCountFrequency (%) 
6163410.1%
 
6337510.1%
 
7638910.1%
 
9571910.1%
 
12805910.1%
 
ValueCountFrequency (%) 
1345435210.1%
 
823370410.1%
 
137192010.1%
 
137102610.1%
 
136982110.1%
 

Clump Thickness
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean4.426917510853835
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.4 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median4
Q36
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.815860703
Coefficient of variation (CV)0.6360770663
Kurtosis-0.6242633153
Mean4.426917511
Median Absolute Deviation (MAD)2
Skewness0.5915025806
Sum3059
Variance7.929071499
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
114220.5%
 
512918.7%
 
310615.3%
 
48011.6%
 
106910.0%
 
2507.2%
 
8466.7%
 
6334.8%
 
7233.3%
 
9131.9%
 
ValueCountFrequency (%) 
114220.5%
 
2507.2%
 
310615.3%
 
48011.6%
 
512918.7%
 
ValueCountFrequency (%) 
106910.0%
 
9131.9%
 
8466.7%
 
7233.3%
 
6334.8%
 

Uniformity of Cell Size
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.130246020260492
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.4 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q35
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation3.041328337
Coefficient of variation (CV)0.9715940273
Kurtosis0.1063917487
Mean3.13024602
Median Absolute Deviation (MAD)0
Skewness1.233241438
Sum2163
Variance9.249678055
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
137954.8%
 
10659.4%
 
3517.4%
 
2456.5%
 
4405.8%
 
5304.3%
 
8294.2%
 
6273.9%
 
7192.7%
 
960.9%
 
ValueCountFrequency (%) 
137954.8%
 
2456.5%
 
3517.4%
 
4405.8%
 
5304.3%
 
ValueCountFrequency (%) 
10659.4%
 
960.9%
 
8294.2%
 
7192.7%
 
6273.9%
 

Uniformity of Cell Shape
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.2011577424023154
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.4 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q35
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)4

Descriptive statistics

Standard deviation2.959886179
Coefficient of variation (CV)0.924629905
Kurtosis0.01849497393
Mean3.201157742
Median Absolute Deviation (MAD)0
Skewness1.163698359
Sum2212
Variance8.760926194
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
134850.4%
 
2598.5%
 
10568.1%
 
3568.1%
 
4446.4%
 
5334.8%
 
7304.3%
 
6304.3%
 
8284.1%
 
971.0%
 
ValueCountFrequency (%) 
134850.4%
 
2598.5%
 
3568.1%
 
4446.4%
 
5334.8%
 
ValueCountFrequency (%) 
10568.1%
 
971.0%
 
8284.1%
 
7304.3%
 
6304.3%
 

Marginal Adhesion
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.824891461649783
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.4 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.8665517
Coefficient of variation (CV)1.014747554
Kurtosis0.9320711039
Mean2.824891462
Median Absolute Deviation (MAD)0
Skewness1.507562523
Sum1952
Variance8.217118648
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
140158.0%
 
3588.4%
 
2568.1%
 
10558.0%
 
4334.8%
 
8253.6%
 
5233.3%
 
6223.2%
 
7131.9%
 
950.7%
 
ValueCountFrequency (%) 
140158.0%
 
2568.1%
 
3588.4%
 
4334.8%
 
5233.3%
 
ValueCountFrequency (%) 
10558.0%
 
950.7%
 
8253.6%
 
7131.9%
 
6223.2%
 

Single Epithelial Cell Size
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.211287988422576
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.4 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median2
Q34
95-th percentile8
Maximum10
Range9
Interquartile range (IQR)2

Descriptive statistics

Standard deviation2.199852417
Coefficient of variation (CV)0.6850374134
Kurtosis2.215127492
Mean3.211287988
Median Absolute Deviation (MAD)0
Skewness1.719041483
Sum2219
Variance4.839350658
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
238355.4%
 
37110.3%
 
4486.9%
 
1456.5%
 
6415.9%
 
5395.6%
 
10304.3%
 
8202.9%
 
7121.7%
 
920.3%
 
ValueCountFrequency (%) 
1456.5%
 
238355.4%
 
37110.3%
 
4486.9%
 
5395.6%
 
ValueCountFrequency (%) 
10304.3%
 
920.3%
 
8202.9%
 
7121.7%
 
6415.9%
 

Bare Nuclei
Real number (ℝ≥0)

ZEROS

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.1881331403762663
Minimum0
Maximum9
Zeros413
Zeros (%)59.8%
Memory size5.4 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q31
95-th percentile7
Maximum9
Range9
Interquartile range (IQR)1

Descriptive statistics

Standard deviation2.128326107
Coefficient of variation (CV)1.791319537
Kurtosis3.86311774
Mean1.18813314
Median Absolute Deviation (MAD)0
Skewness2.141039472
Sum821
Variance4.529772017
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
041359.8%
 
113018.8%
 
5304.3%
 
2304.3%
 
3284.1%
 
8202.9%
 
4192.7%
 
991.3%
 
781.2%
 
640.6%
 
ValueCountFrequency (%) 
041359.8%
 
113018.8%
 
2304.3%
 
3284.1%
 
4192.7%
 
ValueCountFrequency (%) 
991.3%
 
8202.9%
 
781.2%
 
640.6%
 
5304.3%
 

Bland Chromatin
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.435600578871201
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.4 KiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median3
Q35
95-th percentile8
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.442345103
Coefficient of variation (CV)0.710893204
Kurtosis0.1898953894
Mean3.435600579
Median Absolute Deviation (MAD)1
Skewness1.102753118
Sum2374
Variance5.965049603
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
216523.9%
 
316123.3%
 
115121.9%
 
77110.3%
 
4405.8%
 
5344.9%
 
8284.1%
 
10202.9%
 
9111.6%
 
6101.4%
 
ValueCountFrequency (%) 
115121.9%
 
216523.9%
 
316123.3%
 
4405.8%
 
5344.9%
 
ValueCountFrequency (%) 
10202.9%
 
9111.6%
 
8284.1%
 
77110.3%
 
6101.4%
 

Normal Nucleoli
Real number (ℝ≥0)

Distinct count10
Unique (%)1.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2.882778581765557
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.4 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile10
Maximum10
Range9
Interquartile range (IQR)3

Descriptive statistics

Standard deviation3.066297646
Coefficient of variation (CV)1.063660479
Kurtosis0.4263861649
Mean2.882778582
Median Absolute Deviation (MAD)0
Skewness1.407485407
Sum1992
Variance9.402181254
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
143763.2%
 
10618.8%
 
3426.1%
 
2365.2%
 
8243.5%
 
6223.2%
 
5192.7%
 
4182.6%
 
9162.3%
 
7162.3%
 
ValueCountFrequency (%) 
143763.2%
 
2365.2%
 
3426.1%
 
4182.6%
 
5192.7%
 
ValueCountFrequency (%) 
10618.8%
 
9162.3%
 
8243.5%
 
7162.3%
 
6223.2%
 

Mitoses
Real number (ℝ≥0)

Distinct count9
Unique (%)1.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1.593342981186686
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Memory size5.4 KiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q31
95-th percentile5
Maximum10
Range9
Interquartile range (IQR)0

Descriptive statistics

Standard deviation1.723128843
Coefficient of variation (CV)1.081455069
Kurtosis12.51342181
Mean1.593342981
Median Absolute Deviation (MAD)0
Skewness3.544555352
Sum1101
Variance2.969173011
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
157282.8%
 
2355.1%
 
3324.6%
 
10142.0%
 
4121.7%
 
791.3%
 
881.2%
 
560.9%
 
630.4%
 
ValueCountFrequency (%) 
157282.8%
 
2355.1%
 
3324.6%
 
4121.7%
 
560.9%
 
ValueCountFrequency (%) 
10142.0%
 
881.2%
 
791.3%
 
630.4%
 
560.9%
 

Class
Boolean

Distinct count2
Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory size5.4 KiB
1
453
0
238
ValueCountFrequency (%) 
145365.6%
 
023834.4%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

Sample

First rows

df_indexIDClump ThicknessUniformity of Cell SizeUniformity of Cell ShapeMarginal AdhesionSingle Epithelial Cell SizeBare NucleiBland ChromatinNormal NucleoliMitosesClass
0010000255111203111
1110029455445713211
2210154253111223111
3310162776881343711
4410170234113203111
551017122810108719710
6610180991111213111
7710185612121203111
8810330782111201151
9910330784211202111

Last rows

df_indexIDClump ThicknessUniformity of Cell SizeUniformity of Cell ShapeMarginal AdhesionSingle Epithelial Cell SizeBare NucleiBland ChromatinNormal NucleoliMitosesClass
6816896545461111201181
6826906545461113201111
683691695091510105454410
6846927140393111201111
6856937632353111202121
6866947767153111321111
6876958417692111201111
6886968888205101037381020
68969789747148643410610
69069889747148854510410